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ABSTRACT 

In an adaptive test* the test administrator chooses 
test items sequentially during the test* in such a way as to adapt 
test difficulty to examinee ability as shown during testing. An 
effectively designed adaptive test can resolve t&e dilemma inherent 
in conventional test design, By tailoring t#sts to individuals* the 
adaptive test can approximately aphieve the high point precision of a 
peaked test and can extend that high level of precision over the wide 
range of a uniform test. As a result, a well-constructed. adaptive 
test should be more broadly applicable than a conventional test of 
cepparable item quality and test length* since its precision 
characteristics make it useful for classification about one or many 
cutting points* as veil as for measurement over a vide range, This 
paper defines adaptive aental testing in relation to conventional 
mental testing* outlines^the major research issues in adaptive niental 
testing* and reviews the state of the art. for each of the research 
issues. The research issues are: (1) psychometric theory! (2) design 
of adaptive tests! (3) scoring adaptive testsi (4) the testing 
medium! (5) item pool development! and (6) advances in measurement 
technology, (Author/RL) 
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The Por£ grille! 7t i 1 ion Tecunieal Aro 2 of the Army Research In- 
stitute for the Behavioral and Social Sciences (ARI) is concerned with 
ioveLopinq more effective techniques for measuring people's abilities ? 
~o aid in Army job assignment, An emerqriny technology which offers 
-or.^i dor able promise in this area is computer-based adaptive mental 
testing , this rerort was prepared under Army Project 2016J717A76e , 
Manpower Systems Tecnnoloay, to identify technology yaps and do f 1 c i 1 - 
cies and to summarise new trends in the state of the art of mental 
test i r : : . 

r..-.' iu; ori wa^ ; r-ei area while l;kj author was a staff member ot 
API, !i v is presently on the staff of tile Naval Personnel Research & 
'■■ • : :-.:::t.r, 5a:. Diejo, Calif, 
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ADAPTIVE MENTAL TPSTXNJ: THE STATE C P THP ART 



P.: p: i roment : 

To i.:-::-r. i : y techno loqy -jap =3 and f i e" iunc I --s and to ^unc.iri ':- now 
tre: m . is in the dtJits j: ^hc- :irt of mortal t*~*sti: } q , 



Ah -i- t i nv ;i } to^ri::.; in defined in relation to convent i or;a 1 
::>.-:-. t:.i 1 iij^tina. The state of the art is a^debs-d . for each of nix ro- 
5cai'r.'h issur-5 in adaptive mental testing: (1) psychometric theory;, 
(2) dosiqn of adaptive tests; (3) scorinq adaptive tests; (4) the test 
ire; medium; (5) item pool development; and 0>) advances in measurement 
*:.■.:,:::.. .. I ogv . 

Findings; 

Specific research requirements are identified for each researc: 
issue in adaptive mental testing, Diseussion of these requirements 
is also p r o v i ded . 

Ptilisation of Findings^ 

This research forms a basis tor designing a research and develop- 
ment program for application of adaptive mental testing technology to 
military applicant selection and job assignment. 
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The ^'isur^jn: of psy choio-H cal :r:ati is usually L i ahw i by 

)Ca%:rv::i:; :ne reiLonsei of exirriinuys to selected tost it gins. F^r s-jIiu.- 
traits, ■ "ably the ability - apnitude trai:^ assessed during pordon.n^l 
select ion and class i f teat" ion f all oxor. i::e-._ s are roauirod tn answer a 
aamncn io: ;• f 1:^^ , .a. a : :v_- t. <..a is a linear corr.rosite of 

.ii -.no zomous item sJjrj^, Tnis cohi scoro i.=; a a.'d a a a:, index or" in.;i = 
viiu.il i l f f o ron au ^ i z> di z i orvnt iato umonu the •■ .-i'. s iis l t«_jd . 

Ir- aa.-, Ion : boon known uhat admin istar inn "no sano test items to 
t ^ «. • r : 4 7? i ; » s — = _i s i s uar.u m _' oi t t=n itjn..i i i *- •-. ' u ■ l » r = a l a ' — ■ ■ r*. : V i -.i v x a a . . . a , 
rtimal U h a f iminab 1 1 i t y f ua: t hai the ibility t. ,.- J i : £ orc:i t ialo a-acu- 

i 1 • 1 y ..in', "a ; ; vranaio a" vary i. ay 'a' i i ' ;na"u^ • .a 1 a bo ^ • a a a a a . : a by : a .: : - 
viauaiiy tuiionnn r_ao test items to the status of the examinee. In 
ability measurement terms, this connotes dynamically tailoring terst i t »an 
difficulty to tho ability level of the individual, A tost that proceeds 
ir; this fashion is called an adaptive, or tailored, tost (Weiss i Beta, 
ijTi; Wood, Ijll) . Adaptive tests have striamn psychometric advantages 
over conventional tests under certain circumstances , and they have arou^^d 
considerable interest among test theoreticians. 

The development of adaptive testing has been motivated largely by 
recognizing that conventional group ability tests do hot measure indi- 
vidual differences with eaaul proci sioii ■ at : al 1 levels of auility; ..is 
is because accuracy and precision of measurement ate in part a function 
of the appropriateness of tost item difficulty to the ability of the in- 
:n vidua! bvin^ measured. 

To measure- with high p re c i s i cm at all 1 eve 1 s o f ab i 1 i t y requires 
tailoring the test— -by either item difficulty or test length, or both— 
to the individual, Since ability is unknown at the outset of testing, 
the tailoring process must be done during the test, hence the require- 
ment for adaptive ability testing. This is done by choosing test items 
sequentially, during the test, to adapt the test to the examinee 1 s 
ability as shown by responses to earlier test items. This can be done 
by a human examiner, using naper-and-penei 1 tests with special instruct 
tions, or by means of a mechanical testing device. The device most com- 
monly used is an interactive computer terminal. 

The motivation for adaptive testing is that it should permit measur- 
ing ability with higher and more equal precision throughout a wide ability 
range than can conventional group tests in which all persons answer the 
same test, items. In terms of classical psychometric indices, improved 
measurement in that sense should be accompanied by corresponding improve- 
ments in reliability and in external validity. In addition to the psycho- 
metric benefits, there are potential psychological benefits to examinees 
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l:'. ru,;u;:, j:, :"ru-;ti\i:i;:, j.r .a :t l o: r 1 t 1 , . : i *a_~ aia-. u:. * t-. - : l 

aifricuity i, tho individual. 

Thv ra tianab' bvh Lnd . t ta r i v- • ia*sti:;i : i a - ■'•x i ^im for yocirs. T":u 
* ta:, for d-3ir.vt ir.t^lii Jer.Ct-- tv \\ a:; a.uruv-' U;st # jtinini ?Ueri.*d ; • r- 

^,:._illy by a .-ikilied examiner a Mass j^tr.-i adaptive movhoubj 

aouLI make personal adminis trat ion imp r act ica L , however, Ida.- 

;».-vo Lotmort jf adaptive testing awauea the availability of testing 
aadia that w..-uld permit widespread use of adaptive tests on a fairly 
iar :e scale , A :yor-bo » of prob 1 ^ms—psychomet r i c and tochnoloqical-- 
: t ; no solved Lvforo adoptive itj^t ir. ; coal . be practical un a Kir 
5^ ile* i"hid p i: •• • r contains a review of some of those problems, and a 
- .anna ry f the -tate of the art ir: research address inq them. 



. t : aia i b, a-: t be Star; 

d ;:.v-j:i tiona I troup administratis tests of psychological variables, 
.■ a.:a as mental abilities, involve admin is ter ma of a common set of items 
t w il t . bae tot^i s^oro an such test^, usually tb^ iiuni/t?t car- 

re^t or som^ transformation thereof, is used to index individual differ- 
enced on the variable being measured. This procedure has been sanctified 
ay 1 aaiataad i practice and by empirical usefulness, but it has disad- 
vantages as a measurement technique. 

To construct a conventional test, the test designer chooses some 
subset jf items from a larger pool of available Items known to measure 
the variable of interest, Since the items in the pool typically vary 
: -. their psychometric properties — particularly in their d i f f icul ty--tht- 
-_•„•.;-.. iesigner must decide what configuration of these item psychometric 
; roparties bust suits the test's purpose, There are two extreme ration- 
ales to miJo that decision, One rationale is to choose items that are 
hi jhiy homogeneous in item difficulty, A test so constructed, called a 
M p*' ibed" test, will discriminate very effectively over a narrow range of 
tne variable, but will discriminate poorly outside that range, The 
purpose of a peaked test design is to make fine discriminations in the 
vicinity of a cutting point; e.g., to categorize examinees into "go" 
\r i M r.o-qo" jroups "or selection purposes. 

At tne ot low i to extreme is the "uniform 1 * test, constructed of items 
taut a: * he teroqeueous in difficulty, with item difficulty parameters 
j: rau : over a wide ranqe. A uniform tost will discriminate with more 
or less e ruivulent precision over a wide range of the variable, but 
(otru«r things being equal) the level of precision will be substantially 
1 uwr thin "hat of the peaked test at the latter 1 s best point, The 
; urpuse of a uniform test is to measure with equal precision throughout: 
a wide ra:a;e of the trait; e.g. , to obtain information on which to aid 
assignment decisions to jobs requiring varying amounts of the tested 
:ir d ! i > y . 
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*-*. i-'i: i t •• -r.v-j:-* ic:ui I '-s* . :' ientth, a - 

jit;:^i m_^i c:.>.s-,..- t- X'.v^u, h:un ; rv*::.?:-.':: ovor - a wiy narrow r.iirH-, -in-: 
1 w : mod-.-r.iVj ;. .* l .-v*-- r a r. A cjun- u. Imyv beo h 

; in i w: t un :e ';;:lvj3 the ta-s*: is vera lou \ nr r h*. j : t 

<-:.:: :c;l*y l.; v t ; '. : 5 . .1 :. ' s L-.-v-d a. : ; iW ur. - i» * r 1 y l : i \ v,i: i - 

i; : ■• - : - - i * i-j: i afi.-.n im: ■> m „t . "hi- .tl t .-m-it i w-- 

* ,» : L;: l:: : ~ ■ • s *_ u l T r l _- 4 1 1: y t ...» • ■_■ x ; i_~ . i — : \ \ r t,*s ~~ \ :*. t i i ■=*. i n : • :•- 

: 1 : * - • ' 1 * - •"" v , v:, , .'v:.: l^mL j : : rest i :» : ; i . 



i - : a " ; v ..- .. ".. , ; . • : . . * .i ir:.i : s* : .it • ■ *.' . . . •..•.--•.< i t . - 

y ..:uri:.,: thv tost, in ;.:ch a way is tn .* m; • • at .: j f :" it/ul la/ 
ul : 1 i t / a.-. >w: ; iuri:.: *. a l ' ' : . A;; - ■ f : o : v. : w I y ieS i i 
0 e. : ■_- : v- • • : : b auai L : . ;u- ■ :"; H i:. ..' -• : V«jji t LO,nu I teHt, 

tu l lor i .: ' s \ -; * ■ i : , i i v i i : ,t L a , j.j,i;tav- t. : • a a. in ap- 

: - a. ; . . ii:L:i r . ; a. : ii a i : uuk^i u^: ana aaa 

a i -a I-av-;l • :' ; raa is; ova t :\> • wi ;• • ranaa .a" a uniform 
"•■•••^ - i rw.iult, a wa 1 1-caaat rue :od aaa^tive cost shuaU be mora 

ara; :Ly .riiijula than a a an van t iaaa I :.a;-a_ of aaa; araa]-.- item :ual i t.y 
mi ^ - -.a. la. : - h , h : n :o i 1* ^-aaa^inn - h i r a ^* • a a - f i - :v.a-n^ it -i f a ! 
: a uaif u;Hion about ana ar many auttiia; point ^ , an well a.i for 
r.t = - .a- a Rti-n t ov»j r t walo raaaa, j 

It ia i :• : or raar. r.o uiaia rs taaa how an aaaativa teat can achieve 
.ayjag^aaaj aav. tat jaea over conventional toas ! It can ba shown that 
meaaurement error l.h a function of the disparity between item diffi- 
culty and personal ability, aa well as the discr iminatina power of the 
teat i tamo and their susceptibility to auossinq . aiaao a peaked test 
conaenr. rates item ditficulty at a siaale ability level, measurement 
eiuur .-aiaai i be smallest at that critical level, and increasingly lar jor 
a: ability lv-v^l^ deviant fr^n the cri'iral point, In the case of a 
uniform teat, item difficulty is spread over a wide range; consequently, 
measurement error tends to be low to moderate and fairly constant over 
a correspondingly wide ranqe . 

Wh it is desirable , o f course, is to a c h i e ve sma 11 moasu r emo n t 
error over a wide ranqe of the trait scale. This can be done only by 
( adP.inistermq items of appropriate difficulty at every ability level 
af inter^.^t. ^r he ratio nale of adaptive tostinq is to db this more 
efficiently ( i . e , , in lawer items) than can be done by conventional 
means, This implies individualized choice of test items for each ex- 
amirioe. Administratively, this can be accomplished (a) by individual 
testinn by skilled examiners, (b) by specially designed qroup-admin is tared 
pape r-and-pencil adaptive tests with rather complex instructions, 1 or 



A: example of this kind of test is the flexi level test devised by 
ho: i (1 ma) . . ' 
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.(c) "by automate*.! testing using a computer or a specialized stimulus 
programmer to choose and administer test items, Research in adaptive 
-testing has emphasized computer-controlled - test administration . 

* 4 

Early research [pertinent to adaptive testing was reviewed by 
Weiss and Betz (1973') , and by Wood (1973) , Subsequent research has 
been reviewed by thid writer (Mc&ride , ly76a) , Research in adaptive 
'testing has progressed from exploratory studies of item branching 
tests (e,g,, Seeley, Morton, S Anderson, 1962), through the explica- 
tion of a novel test* theory- applicable to tailored tests (e.g.", Lord, 
1970, 1974a), to the verge of operational implementation of a large- 
scale adaptive testing system for personnel selection (Urry, 1977b), 

From a psychometric viewpoint', adaptive- tests are attractive for 
a number of reasons. Adaptive tests represent a breakthrough* in the 
technology of psychological measurement, because they can yield more 
precise measurement over a wider .range with substantially fewer items 
than car. conventional tests. In othe ; r words, adaptive tests can achieve 
higher validity of measurement than comparable conventional tests in a 
given test length; or, they can attain a given level of validity, in sub- 
stantially fewer items. than a comparable conventional test (Urry, 1974)1 

Other aspects of* adaptive tests also make them attractive, par- 
ticularly if they are computer-admirals tered , Tailoring test difficulty 
to examinee ability may reduce error variance caused by examinee frus-' 
tration, boredom, or test anxiety (Weiss, 1974), as well as by guessing. 
Computer 'administration and scoring can reduce human error 'in marking 
answers, scoring the tests> and recording the results, .Test compromise 
cad be reduced substantially, by eliminating test booklets (thus negat- 
ing theft) and by * individualizing test construction (thereby thwarting 
the use of cheating devices). Printing, storage, and handling of test 
booklets and answer sheets can be eliminated, saving costs, r . 

•The psychometric' and practical potential of adaptive testing makes 
it' worthy of research and development in the military manpower setting, 
with the goal of eventual implementation of an automated system for 
test administration and scoring , and personnel selection, classifica- 
tion, and job-choice counseling. See of the relevant research 1 has 
already been done and has been reviewed as cited above. One outcome 
of the completed research has been the crystallization of a number of 
research issues that need to~.be resolved before deciding whether to 
implement an adaptive testing system. The purpose of , this' report is 
to present some of those issues and- to evaluate the state of the art 
with respect to their resolution, 
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RESEARCH ISSUES 



Psychometric T_h_eo_rv 

Early adaptive testing research showed that traditional test 
theory was an inadequate basis for the, construction and scoring of 
adaptive tests ( e . q . , Bayroff & Seeley, 1967) . This was due to re- 
quirements for item parameters that were invariant with respect to ex- 
aminee group, and means of scoring tests in which different examinees 
answered sets of items that differed in difficulty, number, and other 
respects as well, One resolution of this issue was provided by the 
earlier development of item" response theory (Rasch , I960; Lord , 1932, 
1970, 1974a; Birnbaum,, 19685 that provided the needed in variance 
properties for item parameters and test scoring capabilities. 

Subsequent approaches to adaptive testing were developed that 
did not depend on the rather strong assumptions of item response theory, 
Kalisch (1974) and Cliff (1976) both presented theory and methods for 
adaptive testing that are not based on the stochastic response models 
of item response theory. Other psychometric bases appropriate for use 
in adaptive testing may be forthcoming. Clearly, one research issue 
to be addressed is the adequacy of the psychometric foundation of any 
proposed approach to the implementation of adaptive testing. 

Item Res ponse Models 

Most adaptive testing research since 1968 has used item response 
theory (item characteristic curve, or latent trait, theory) as a psy- 
chometric basis. I Within item response theory, several competing re- 
sppnse models-^ foi\ dichotomously scored items have been proposed, These 
models differ in mathematical form and in the number of parameters 
needed to account for item-* response behavior. Some of these models 
include the one-parameter Rasch logistic model (e.g. , Wright & Douglas, 
1975L; the two-parameter normal ogive model (Lord & Novick, 1968) ; and 
the three-parameter logistic ogive model (Birnbaum/ 1968) , These models 
differ in/jnathematical complexity and in the procedures required to im- 
plement them in practice, If adaptive tes.ting research is to be based * 
n item response theory, a consequent research issue is to choose from 
rnonq the available response models the one best for the purpose, The 
oasis for such a choice should' include consideration of the appropriate- 
nesk/of the competing models, their robustness 1 under violations of rele- 
vant' assumptions, and the difficulty and expense of implementing them. 

D esign of Adaptive Tests 

Strategies for Adaptive Testing 

Adaptive testing by definition involves sequential selection of 
the test items to be answered by each examinee . Numerous methods for 

' 5 
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sequentially choosing items have been proposed. These methods, called 
"strategies" for adaptive testing, were reviewed by Weiss (1974), Since 
then, several new ones have come forth (e,g. # Cliff, 1976; Kalisch, 
1974; McBridcje, 1976b) . 

These strategies vary along a number of dimensions, includinq math- 
ematical elegance , item selection algorithms, scoring methods, and others, 
Thure is a c ltiar need for research tu compare the various strategies on 
th^ir -psychometric and practical merits to provide the data needed to 
quid© a choice among strategies. 



Test Length 

Any mental test has some criterion for test termination--a rule for 
stopping. Usually, a power test terminates when the examinee has answered 
all the items (although a time limit may be imposed for administrative 
convenience) . Some adaptive testing strategies also use fixed test 
length as a stopping rule: Terminate testing when the examinee has 
answered some fixed number of items, Other strategies for adaptive 
testing, however, allow test length to vary from one examinee to another 
by basing the termination decision on some criterion other than test- 
length. For example, testing may be terminated when a ceiling level of 
difficulty has been identified {e.g. , Weiss' (1973) stratified adaptive 
strategy) , or when a prespecif ied degree of measurement precision has 
apparently been attained (e.g. , Urry , 1974; Samejima, 1977). 

The research issue here concerns the relative merits of fixed 
length versus variable length adaptive tests. Is one alternative gen- 
erally preferable over the other or preferable for some testing purposes 
but not for others? The notion of variable length tests has some intui- 
tive appeal. Research is required to verify whether variable length 
tests have psychometric and practical merit. 



Test Entry Level 

/mother aspect of the design .of adaptive tests is test entry ievel-- 
the difficulty level of the first item(s) the examinee must answer. In" 
some cases there may be reliable information available prior to testing 
that would justify the use of different starting points for different 
examinees. For example, in a multitest battery, some subtests . are sub- 
stantially intercorrelated; an examinee's score on an early subtest may 
provide useful data for choosing entry level on a subsequent subtest. 

The use of differential entry levels may permit us to improve 
measurement accuracy or to achieve a given level of measurement accu- 
racy in even fewer items than an adaptive test that uses a fixed entry 
level. Research is needed to determine if these potential advantages 
of differential test entry level can be achieved. 
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Scoring Adaptive Tests 



Because an adaptive test is fundamentally different from a conven- 
tional test in which everyone answers the same questions, it follows 
that conventional test scoring methods may not be applicable to adap- 
tive tests. That is, it may make little sense to score an adaptive 
test by weighting and summing the dichotomous item scores. If so* al- 
ternative scoring methods are needed, which gives rise to yet another 
research issue? What means of scoring atlmtive tests are available, 
and which are "best" in some important sense? 

A related issue is the comparability of scores on adaptive tests 
with more familiar scores on standardized conventional tests, Are ap- 
propriate score equating methods available for transforming adaptive 
test scores into the metric of raw or converted scores of established 
conventional measures having the same variables? 

The Testing Med i um 

Conventional ability tests are typically administered via paper 
and pencil, and constructed of multiple-choice items, Adaptive tests 
using the same item types may be administered individually (a) by a 
skilled examiner, (b) at an automated testing terminal, perhaps con- 
trolled by a computer; or (c) by means of specially constructed paper- 
and^pencil tests. 

Individual testing by skilled examiners is impractical for large^- 
scale use* Thus, only automated testing terminals and specially de^ 
signed paper-and-pencii tests merit serious consideration as potential 
media for adaptive testing on a large scale. Whether paper^and-pencil 
adaptive testing is even feasible is prpblematic because of the require 
merit for sequential item selection, Another research issue, then, con= 
cerns the feasibility of group administration' of pape r- and-penc i 1 adap- 
tive tests. 

The feasibility of automated test administration is not in ques- 
tion, since the presentation of test items and the recording and proces 
ing of an examinee's responses can be done using modern computers with 
interactive visual display terminals, such as teletype, cathode ray 
tube (CRT) , or plasma tube (PLATO) terminals. 

Nevertheless, computers and computer terminals are presently 
relatively expensive compared to traditional 1 , printed test booklets and 
answer sheets. It may be preferable to base automated adaptive tests 
on devices that are somewhat less sophisticated ■ and less costly than 
full-scale computer systems. Still another research issue surfaces 
here: What alternative devices/systems may be used for automated 
adaptive testing, and what are the advantages and disadvantages of 



each? 
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Item Pool Deyglo flmunfc 



Selecting the items to constitute an adaptive testing item pool is 
a somewhat larqer undertaking than choosing items for a conventional 
.te^t , The psychometric criteria for item selection and for pool con- 
struction are more rigorous than those for conventional tost design, 
and the item pool must be substantially larger than the length of any 
; individualized tosst drawn from it. Since the degree to which an udap- 
' tive test realizes its potential may be limited by the size and quality 
of its item pool, it is imperative that research defines the necessary 
or desirable characteristics of item pools for adaptive testing and 
provides practical prescriptions for item pool development. 

Advance s in Measurement Methodology 

Adaptive administration of traditional dichotomously scored test <■ 
items promises a significant gain in the psychometric efficiency of 
measurement. Since adaptive testing research has stressed the use of 
computer terminals for test administration , we should exploit the 
unique capabilities of computers to control test situations that are 
vastly different from the relatively simple tasks that comprise paper- 
and-pencil tests. New approaches to ability measurement may arise 
from the conjunction of adaptive test design and computerized test ad- 
ministration , and thus a number of research issues may arise* These 
issues could include the following: How can the expanded stimulus and 
response modes made possible by computer administration be exploited 
to improve the measurement of traditional ability variables? What new 
variables can be identified and measured using the computer's unique 
capabilities? Are scaling techniques available that are appropriate 
for those new measures? How does the utility of new measurement methods 
compare with that of traditional testing? 



The problems originally hindering the development and imp lame nta= 
Lion or adaptive testing were (a) psychometric and (b) practical* The 
psychometric problems concerning adaptive tests included the inappropri- 
ateness of classical test theory, the lack of prescriptions, for their 
design, the ' need for methods of scoring, and the need for assessing the 
measurement properties. The practical problems included the need to 
develop new media for administering adaptive tests and the difficulty 
of assembling the large pools of test items demanded. Each of these 
problems will be discussed below, followed by a brief exposition of 
the state of the art relevant to solution of specific problems. 



THE STATE OF THE ART 
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Psy chome t rig Theory 



Discussion 

Traditional, or classical, test theory is inadequate to deal with 
-some of the psychometric problems posed by adaptive tests. The problem 
in classical test theory was to order persons with respect to an indi- 
vidual differences variable on the basis of their number correct or 
proportion correct on common or equivalent tests * The observed score 
was assumed to differ from the "true score" by a random variable that 
was uncorrelated with true score. In adaptive testing, different per*- 
sons respond to sets of test items that are in no sense equivalent 
across persons, These individualized tests may differ in difficulty, 
length, and the discriminating powers of their items. Obviously t the 
number or proportion of correct scores is generally an inappropriate 
index of individual differences; additionally, measurement error cannot 
be assumed to be independent of the variable being measured „ A test 
theory was needdql that could accommodate the special requirements of 
adaptive tests, 

/ 
/ 

Several solutions -to this problem might be forthcoming* A class 
of solutions currently exists, in the body of latent trait mental test 
theories, or item response theory. These "theories" are actually statis- 
tical formulations that account for test item responses in terms of the 
respondent's location on a scale of the attribute being measured by the 
item* The best developed formulations to date deal with dichotomous 
item responses as functions of a unidimensional attribute variable , 

In the language of ability and achievement testing, latent trait 
methods treat the probability of a correct response to a test item as 
a monotonic increasing function' of the elevant underlying ability* When 
a scale for the ability is established, the latent trait methods provide 
mathematical models relating response probability to scale position* 
These models are item trace lines, or item characteristic curves ( i , c . c • ) , 

Once a scaling of" the attribute has been accomplished and all the 
item characteristic functions are known , the location of an individual 
on the attribute continuum can be estimated statistically from the di- 
chotomously scored responses to any subset of the test items* Such an 
estimate is a kind of ."test score"; the advantage of using latent trait 
methods for scoring is that all scores are expressed in the same metric* 
regardless of the length or item composition of the test* Thus , within 
the limits of the method, automatic equating of different tests can be 
effected merely by using latent trait methods for scoring the tests. 
This, feature make* 4 , latent trait test theory an especially appropriate 
basis for adaptive testing. 

The prevailing trend in application of latent trait methods has 
been to scale the measured attribute in such a way that all item char- 
acteristic curves have the same functional form, differing from item 
to item only in the parameters of the item characteristic functions . 

\ 
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Thus, once the general functional form has been established, each test 
item can be completely characterised and differentiated from other test 
items by the parameter (s) of its i.c.c. For attributes such as ability 
and achievement variables, where item trace lines should be mono tonic 
in form, several similar response models have been developed in detail. 
These, include a one^paraineter logistic ogive model due to Rasch (i960) , 
of which Wright (1968; Wright S Panchapakesan , 1969) has been a lead- 
ing proponent in this country; a two-par am e t e r extension of the Rasch 
model by Urry (1970) ; a slightly different two-parameter logistic ogive 
model developed by Birnbauin (1968) a similar model based on the normal 
ogive, developed by Lord (1952; Lord & Novick, 1968); and a three- 
parameter logistic ogive model (Birnbaum, i960) , All of these models 
express the probability of a correct (or keyed) response to a dichoto- 
mousiy scored test item as an ogive function of attribute level, Syn^ 
"actually, this may be expressed 

P (1/A) - F U,b,c;A) . (1) 

The expression on the left of the equality is the probability of the 
keyed (1) response to item g, given A, the attribute level* F (a,b # c;A) 
is a general mathematical function in the item parameters a, b, and c 
and the person parameter, attribute level A. In the ogive models , F 
is an ogive function of the distance (b -A) , a scale parameter a, and 
an asymptote parameter, c* 

Where more than one item is administered , , the probability of any 
pattern (V) , or vector, of item -scores may be calculated readily by 
virtue of a local independence assumption. Thus 

k u l^u_ 

P (v/A5 - H [P (1/A) ] 1 [l-P (1/A)] 9 . (2) 

Here P (v/A) is the probability of the pattern of item scores (I's and 
, given A; Ug is the dichotomous score on item g. From P (v/A) we 
may derive expressions, for the likelihood of any given attribute level , ' 
given the item response vector. This permits us to apply statistical 
techniques to the estimation of A, if the response pattern, v, and the 
item parameters 1 are known (or estimated) beforehand* There are also 
simple, nonstatistical techniques for combining item responses into 
Other indices of individual differences on the attribute^ (See Lord, 
1974a, for pertinent discussion*) \ 

Given that latent trait test theories in principle eln satisfy 
the special requirements of adaptive tests, it remains to ^xplicate 
such theories sufficiently to provide practical methods" for estimating 
the parameters of each test item's characteristics curve and for esti-* 
mating examinee location on the attribute scale. 
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State of the Art 



Statistical methods for estimating item parameters and attribute 
ievels have been developed for all the ogive models mentioned above. 
Computer programs for item parameter estimation are available (commer- 
cially or by private arrangement) from sources listed in Table 1* Most 
of these computer programs perform simultaneous estimation of examinee 
M ability" and of the item parameters. The statistical estimation tech- 
niques used by these programs range from simple approximations in FORTAP 
(Baker & Martin, 1969) to maximum likelihood in LOG 1ST (Wood, Winger sky , 
& Lord, 1976) , FORTAP and BICAL (Wright & Mead, 1977) , to Bayesian model 
estimation in OGIVE I A (Urry, 1976) . 



Table 1 

Existing Computer Programs for Estimating Item Parameters 
of Latent Trait Item Response Models 



Response model 



Prog ram name 



Available from 



l"parameter logistic 
(Rasch model) 



BICAL 



Wright, U, of Chicago 



2 — parameter logistic 



LOGOG 



D. Bock, IU of Chicago 



2--parameter norma, 1 
ogive 



FORTAP 
NORMOG 



F . B . Baker, U. of Wisconsin 
R. D, Bock, of Chicago 



3 — parameter logistic 



LOGIST 



R. M. Lord 

Educational Testing Service 



3—parameter' logistic 



OGIVEIA 
or 

ANCILLES 



V, W. Urry 
Office of Personnel 
Management 



Item parameter estimation procedures generally entail simultaneous 
estimation of a person's ability* The task of ability estimation (or 
test scoring) in the context of adaptive testing is less demanding, All 
item parameters, have been estimated beforehand i what remains is to esti- 
mate ability (or to score the tests in some other appropriate way) from 
knowledge df the item responses and the item parameters. The state of 
the art of ^coring adaptive tests is outlined below. 

To summarize , latent trait theories have been shown to provide ap^ 
" propriate psychometric bases for adaptive testing (see Lord, 1974a; 
Urry , 1977). These theories have been well explicated for application 
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to tests ut unLdimensional attributes, using dichotomously scored items. 
Mathematical, algorithms have been developed for scaling attribute vari- 
ables and for estimating item characteristic curve parameters and examinee 
ability or attribute level. These algorithms have been incorporated into 
computer programs that process raw item responses and yield the desired 
parameter estimates. These computer programs are available from their 
developers , 

Generalizations of latent trait methods to measure ' unidimensional 
variables by means of nondichotomous test items have also been accom- 
plished. Same jima (1969) presented methods for extending the normal 
ogive response model to graded response items. She has since extended 
it to apply to items having continuous responses (Samejima, 1973), 
Bock (1972) developed equations for estimating item parameters and in- 
dividual ability from nominal category responses to polychotomous test 
items. Although they have seen relatively few applications , Same j ima 1 s 
and Bock's algorithms have been incorporated into available computer 
programs* Using graded , polychotomous, or multinomial^response test 
items has potential for appreciable gains in psychometric information 
compared to the information in dichotomously scored items* 

A further advance in latent trait item response models is the ex- 
tension of these moaels to handle multidimensional test items. Same jima 
(1973) has begun work in this area, as has Sympson (1977) . 

The Design of A dap tive; Tests 

. Discussion 

Choosing an Adaptive Testing Strategy , An adaptive test is one 
that bailors the test constitution to examinee ability or attribute 
level; given this definition, we are confronted with the problem of 
how to accomplish tailoring* This problem of individualized test de- 
sign can be brought i'lto conceptual focus by considering that, given a 
fixed large set of test items from which only a relatively small subset- 
is to be administered to an individual examinee, there exists a subset 
that is optimal, in some sense, at any specified test length- The 
items that constitute the optimal subset will vary as a function of 
the individual's attribute level. The problem of adaptive test design 
is that o*f selecting approximately optimal item subsets for each indi- 
vidual examinee. Solutions to this problem are called strategies for 
adaptive test design. 

An adaptive testing strategy consists, minimally, of rules for 
item selection- and for test termination ; a scoring procedure may also 
be an integral part of some strategies * - For comprehensive reviews of 



The term "information 11 here refers to information in the seftse pre- 
sented by Birnbaum (1968) and. discussed below. 
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a variety of adaptive test strategies, see Weiss (1974) or Weiss and 
Betz (1973), ( ^ 

The essential rationale for adaptive item selection involves ad- 
ministering more difficult items following successful performance and 
easier items following less successful performance. If the test is 
item-sequential, this translates to selecting a harder item after a 
correct item response, and an easier item following an incorrect re- 
sponse. Choosing the appropriate difficulty increment is one aspect 
of the design problem, Another central aspect is choosing the cri^ 
terion to be optimized. 

The purpose of mental testing usually is to order examinees with 
respect to their relative attribute status, To achieve this purpose, 
it is necessary to be able to discriminate accurately between any two 
examinees , no matter how close they are in terms of the attribute, The 
required discriminabiiity has implications for the traditional diffi- 
culty index of the items to be chosen; Using dichotomous items on 
which guessing is no factor to discriminate best about a point, choose 
test items for which the probability correct is .50 at the point in 
question. If guessing is a factor, the optimal p-value will exceed 
.5 by an amount that is a function of the effect of guessing, However, 
if the available test items also differ with respect to discriminating 
power, the latter also must enter into the determination of which item 
discriminates best locally. The information function (Birnbaum, 1968) 
of a test item provides a single numerical index by which test items 
may be ordered with respect to their usefulness for discriminating at 
a given point. In terms of equation, the information i in item g at 
attribute level A is expressed as 

a/oA p (l/A) 2 

l q {h) ' /[? (1/A)-"][1-P (l/A) ] * ■ U) 

g g 

That item is "best" for which the local value of Ig(A) is highest, For 
a k-item test, the best subset of k_ items is the subset for which Ig(A) 
is" locally highest. The implication for adaptive test design is to 
choose items so as to maximize Ig(A) at all points A, This maximiza- 
tion is the goal of adaptive test design. Adaptive testing strategies 
may or may not explicitly seek to achieve this goal; and the goal may 
be realised to a greater or lesser extent by the different test 
strategies, - 



^Analogous to the item information function are two others— the test 
information function and the test score information function, both of 
which ' index measurement precision as a function of attribute level. 
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Adaptive test strategies differ in a number of ways. One general 
dimension of these differences is their item selection mode. Some 
strategies arrange test items a priori by difficulty and diserimina^ 
tion into i logical structure, such as a one- or two-dimensional matrix 
and sela ;t items sequentially according to examinee performance by 
branching to a predetermined location in the structure and administer- 
ing the iLem(s) that reside in that location. Such strategies may be 
called "mechanical M by virtue of their almost mechanical rules for item 
selection, Examples of mechanical strategies include the simple branch- 
ing strategies ; the stair-step or pyramidal method used by Bayroff and 
Seeley (1967) and by Larkin and Weiss (1974)" and described by Lord 
(1974©); the f lexilevel tailored test devised by Lord (1971a); the 
simple two- stage strategy, investigated by Lord (1971b) and by Betz 
and Weiss (1974); the stratified adaptive (5TRAPAPTIVE) procedure pro- 
posed by Weiss (197 3) ; and even the Robbins-Munro procedures described 
by Lord (1971c; 1974a), 

Distinguished from the mechanical, or branching, strategies are 
adaptive strategies that use mathematical criteria for item selection, 
Such strategies typically estimate the examinee's latent attribute 
status after each item response, then choose the available item from 
which some mathematical function of that Estimate and of the item param- 
eters is maximized or minimized. Examples of mathematical strategies 
include Owen's (1969, 1975) Bayesian sequential procedure, in which a 
quadratic loss function is minimized; and Lord's (1977) maximum like- 
lihood strategy in which the available item with the largest local in- 
formation function is chosen. 

One of the clearest distinctions between mechanical and mathemati- 
cal strategies that in the latter every unadministered test item is 
potentially eligible for selection at any stage in the test, whereas in 
a mechanical strategy only a small number of items—as few as two— are 
eligible for se Lection at any given stage* Another obvious distinction 
is that the mathematical strategies are appealing by virtue of their 
elegance, whereas the virtue of the mechanical strategies is their sim- 
plicity. In confronting the problem of choosing an adaptive strateay, 
one first must choose between elegance and simplicity. Then/ by elect-! 
ing categorically either a mechanical or mathematical strategy, one 
is faced with the further choice of a specific adaptive testing strate- 
gy. The number of strategies proposed, for use has proliferated faster 
than have research results useful to guide the choice* 

The Test Length _lssue , Confounded with the problem of cjioosing 
a testing strategy is the problem of test length, Like conventional • 
tests, adaptive tests may be short or long; unlike most conventional 
tests, adaptive tests may adapt test , length, as well as test design, 
to the individual. 

The notion of variable length test seems to make sense, since the 
examiner can administer as few or as many items ^ as necessary to measure 
each individual with a specified degree of precision. Furthermore, it 
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is apparent that if measurement precision is to be helj constant, achieve* 
ing that precision should require relatively few items for persons whose 
attribute level is near the central tendency of the population , and more 
items for persons located in the upper and lower extremes of the attri- 
bute continuum, Roughly speaking, if precision is to be held constant , 
the required adaptive test length should be a U-shaped function of at- 
tribute level. 

Among the proponents of variable length adaptive tests are Samejima 
(19775, Urry (1974, 1977a) , and Weiss (1973), Weiss advocates the use of 
a simple stopping rule based on identifying a "ceiling level 1 ' of diffi^ 
culty for each examinee in conjunction with stratified adaptive (STRAD- 
APTIVE) strategy. Samejima (1977) proposed that test length be varied 
such that a constant level of measurement precision (indexed by the test 
information function) be achieved throughout a prespecified range on the 
attribute scale. Urry (1974) espouses using variable test length in con- 
junction with Owen ' s Bayesian sequential adaptive strategy in such a way 
as to yield a prespecified level of the validity! Q f c the test scores as 
a measure of the underlying attribute; the squared validity may be in- 
terpreted as a reliability coefficient. 

It should be pointed out that some adaptive testing strategies are 
inherently fixed-length* Among taese are the flexiievel, pyramidal, and 
two^stage strategies* Others, like Weiss 1 and Owen's strategies, make 
fixed-length optional. The variable-length test termination criteria 
espoused by Urry and Samejima can in principle be used with any adaptive 
strategy--even the ones described above as inherently fixed-length, 
Weiss 1 criterion for variable-length termination of the STRADAPTIVE 
test, however, is somewhat restricted in applicability because it re- . 
quires a certain structure--stratif ication by dif £iculty--of the item - = 
pool.- 

Given the intuitive appeal of variable test length, two problems ? 
remain. One problem is to decide between variable versus fixed .test - 
length and which of the available test termination criteria to adopt, 
The other problem is to verify that the apparent advantages of variable 
test length are realized in practice, , ■ 

State of the Art 

Choos ing an Adaptive Strategy , One of the / first steps in imple- 
menting a program of adaptive testing must be to choose an a'daptive 
testing strategy from among those available. This choice should be 
an informed one, based on the results of research comparing the merits . 



By "validity'* is meant the correlation between the test score (ability 
estimate) and the underlying true ability* This correlation is esti- 
mated from the Bayesian posterior variance 'under Owen's method follow- 
ing each item response by an examinee. 
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of available methods. Very little research has been conducted along 
these lines, howovux . Instead, most adaptive testing research has con- 
centrated on comparing the psychometric properties of specific adaptive 
test strategies against the properties of otherwise comparable conven- 
tional test designs, Weiss and Be tz (19 73) reviewed the results of 
these comparisons. 

Some live-testing .research comparing adaptive strategies was re- 
ported by Larkin and Weiss (19 7 5).. Only two strategies were compared,^ 
however, and the results were equivocal. The only other data available 
as a basis for comparing adaptive strategies are data resulting from 
analytic studies of the properties of various strategies and from model- 
sampling computer simulation studies of similar properties. Lord (1970; 
1971a, b, c) reported the results of analytic studies of several adap- 
tive strategies , but made no effort to compare them* The only studios 
that directly compared several strategies were the simulation studies 
of Vale (1975) and McBride (1976b) , 

, Vale's study compared five leading strategies in terms of the level 
and shape of the resulting test information functions; in other words, 
in terms of relative measurement precision as a function of attribute 
level. Vale's artificial data were based on a response model that did 
not permit guessing. Further , he presented data only for 24-item fixed- 
length testa. His results indicated that under the conditions simulated, 
the Bayesian test strategy was superior in terms of the level of measure- 
ment precision/ whereas the stradaptive strategy was superior in terms 
of measuring with constant precision at all levels of the attribute. 
The other adaptive strategies compared-^the flexilevel, pyramidal, and 
two-stage 3trategieS"all were inferior to the first two in some way. 

Vale's study simulated only the no-guessing situation and a single 
test length and did net investigate mathematical strategies other than 
the Bayesian one, McBride (1975b) extended Vale's results in a series 
of simulation studies comparing the psychometric properties of two 
mathematical and two leading mechanical strategies at six different 
test lengths and under ., severa 1 realistic conditions, including the 
presence of guessing, His results indicated that the two mathematical 
strategies were generally superior to the mechanical ones, especially 
at short test lengths (5 to 15 items) , both in terms of test fidelity: / 
(validity) and measurement precision, At moderate test lengths (20 to 
30 items), the mathematical strategies were still superior, but their 
advantages over the mechanical- strategies were slight, 

The two mathematical strategies wore Owen's Bayesian sequential 
one, and a variant of a maximum likelihood strategy proposed by Lord 
(1977), Differences in results between the two were slight, but the 
maximum likelihood strategy was judged superior in adaptive efficiency" 
the degree to which the methods select the optimal subset of items at 
a given test length — and also in several other respects, ■ 
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McBride concluded that his data favored the maximum likelihood 
strategy overall, but that the choice among the four strategies should 
be influenced by other considerations* For example, the Hayes ian 
strategy was the best of the four, in terms of adaptive efficiency , 
at very short test length (D items) when all examinees buy an the tost 
at the same level of difficulty; at the longer test lengths (25 and- 
30 items), all four strategies had excellent measurement properties, 
ui'td a:,y or. c of thuni cuu Id reasonably be chosen. 

It is important to note that McBride 1 s comparison studies were 
carried out so that the correct test item parameters were known and 
available when simulating each adaptive strategy. In live testing, of 
course, only fallible estimates of the parameters of the item charac- 
teristic curves are available* -The use of fallible estimates should 
introduce measurement errors over and -above those entering into, McBride ! s 
data. It is possible that the effects of such errors could alter some 
of the conclusions McBride reached concerning the order of merit of thta 
four strategies he evaluated. Research is needed extending his findings 
to the case of fallibly estimated item parameters. 

Vale's (107-13) and McBride 1 s (1976b) simulation studies are the 
only ones available for comparing strategies. There is, however, a 
sizable body of research results available for evaluating several in- 
dividual adaptive strategies against conventional tests, Urry and his 
associates (Urry, 1971, 1974, 1977b; Jensema , 1972, 1974, 197 7; Schmidt 
s Guqel, 1975) have reported results of a comprehensive program of com- 
puter simulation investigations of some psychometric properties of 
Owen's Bayesian sequential adaptive test. Vale and Weiss (1975) re- 
;>ort in considerable detail the measurement properties of the stradap^ 
tive strategy. Lord (1977) recently proposed the broad-range tailored 
test (a maximum likelihood strategy) and. reported some data relevant to 
its psychometric properties. All of the'so investigations have utilized 
model -sampling computer simulation methods to explore the behavior of 
the various test strategies* All have also taken different lines of 
approach and concentrated on different aspects 'of each strategy's psy- 
chometric behavior, so that it is not possible to compare the strate- 
gies, on the basis of the available reported data. 

Fixed -Length Versus Variable-L engt h Adaptive Testes , There has 
been no systematic study of the relative merits of variable-length 
versus fixed- length adaptive tests. Rather, researchers in this area 
have tended to make an a priori choice between the two options and 
leave the choice unquestioned. Working independently and motivated 
by different considerations, Samejima (1976), Urry (1974), and Weiss 
(1973) all chose in favor of variable length. Lord (1977) , however, 
opted for fixed length in proposing his broad-range tailored test. 

Samejima (1977), working in the framework of a maximum likelihood 
strategy, suggested that the test information function be estimated 
for each individual after each item response. The test may be termi- 
nated when .the estimated value of the information function reaches., 
a prospecified level. The effect of using the test termination rule 
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r d ano< ' 1 s ^ oc *. :ti inat i on criteria:: would actually achiovu 

'-■--V i'- i - * • • ■•' * • ' iViry, L *"4; /oi.iiona, I;'??; •■/nmidt -, iu.jul, Ld V; j 
^ :V j: " '*" 41 - * ' • *'=.'. ' or u=su witu v;er. ' ri bayosiar .s^ luoriLiai ..idap- 

':*vc i •-■ o. - -t v • ; / - 1'r; ier over 1 s rroc-Jure, i no posterior variance or 
-istr ibu = ion of the Bayos estimator is calculated following each 
1 * ' d - op. ' tnat variCtr.ce, whuh usually diminishes after each 
it. , L.-; interpret^ i Ov a i v the h .run re of ^ho standard error nf 
■,-otim,ite la.i-.;::.! cr :ao examined attribute level, Thus, by turmi- 
narir^ each when cho calculated standard error reaches a prespeci- 

fi<j i net 1 1 value, t;o- itarriari error of estimation in the examinee 

bt-^ controlled and consequently so can an index of reliability 
.>i rh<- ability estimates; Thus, Urry advocates a variable length test 
tormina- ion rule to ensure (approximately) that the adaptive test scores 
hav.- .i ;:ri'.i: ecif iuu level of correlation with the latent attribute being 
meaner^ i . 

- ~ry (IjVi, Lv74) and Jtn&-:M (1^77) have presented the results 
■<->f numerous limitation studies of Owen's procedure to show that the 
fidelity coefficient of the test scores can he controlled by using the 
roster ior variance as a test termination rule, These studies all used 
the true-vala-s of the simulator! test: items' parameters for item solec- 
tier era aooi-r tpejoni It and dugel (1375) presented simulation study 
data for the noA v-riuieal case in which fallible item parameter esti- 
mates are used. The effect of using fallible item parameters with 
wen's procedure was a tendency for the tests to terminate prematurely, 
t that the obtained fidelity coefficients fell slightly 
:i r r • r - ■ i va lues. 



i •- 

th- > 



dubso : .;o:o to his computer simulation studi-s, Urry (1977b) ad- 
min Oioe-: : Bay-.ua:i adaptive tests of verbal ability to live examinees, 
hi.-, unilysis •: the ulaptivo test data evaluated the usefulness of the 
• k " • '- termination criterion for controlling the level of "con- 

struct va L L ii ty"-- correlation of the resulting test scores with an 
independent measure of the same ability, Urry found that for all the 
evaluated Lev-Is of the som, criterion, the obtained validity coe: - 
ficient was -uril to or slightly greater than the forecast validity 
associate: with each test termination criterion. He concluded that 
the theory was supported- that it was possible to control the relia- 
bility and validity of a test by using the haves i an procedure and 
mo:: inula Lin; the : osterior variance termination criterion, 

"r ty .nu. i jthom havu been successful in controlling adaptive test 
•/a L v i i t y / f i a e 1 i t y / reliability by manipulating test termination criteria, 
♦ -at ;n c notwithstanding, they have not .demonstrated that equipre- 

jision of mensurement (a flat information function) could be achieved 
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jama; F .v-ir i :'. ■: » i.-.t- ,i variable test length ['locdure, la.-r have they at- 

: t i !■ . ,jm , McHruio (lv7?) f in simulation ^Ludies of the -^tine 
Bayesian ; f rgcg.iuro, found a strong positive vorro lat ion (.d) between 
•..-.■sr ;th a:.i ability when variable test- Length wan used; i.o., th<- 
te rrninat i'-jf . rirono;, was sitisfiud m f. w-.?r iteni^ for Lowei abiiitv 
ills .Lit : .iir-d ; h it r ne r'»- -a* 1 i; « between a r t r ibu t 

•" • 1 . ii , , t- •.. J: * ^ • ■ a ; i , _ c a. ■ t ;a j -.-a # in j .. 1 ; a • • t j : ua! i:^ i * * 

"«, h "a' : a at a 1 I :. : .a :r.a* : .a ruiu'lio:;. A a a result, the iriformu'ion fan - 
* : • .-t" aa .a^uLi^M Hiy-alaa V i ri ab 1 e- ion \l \\ teats tended t ": ■ ■ 
. -.j. a.-, x in snare, with markedly low va L a-. a-; in the low end of the at ; e i- 
- at- raa-;e 5 McBri4o eon eluded that there may be a renter virtue in 
:' : x . • 1 - I. e : : ;th Hayes ia:; adaptive tests. 

It i \ 1 1 be : b • . i : by : : s. 1 X' t Sw-Uu is/aie^ i ..VglVi'J La '..IwJ.i La ; 

an abusive testing strategy ana in deciding bwtweon fixed and variable 
V baah reman, aar^nolved, Additional research in bath areas is 



!':.»■ I- * air-JsOlvo..; L^sUOs need not impede progress it; the e-xperi- 
::• •: * ii in i-rai^.n i;a oi systems for adaptive te-stiuu, because trie un- 
ra. -wa 1 l :~ : e renaes .inonq tno leading ddaptive testing strateqios are 
an ioubto iiy • : bassor :na- ;a 1 1 ueio than the d i f teroncG * between any such 
ntrai* :y md a aa_a. vent I ona I test dos iqn . [brha; a recaaaiaiiig this / 
arrv ( L ) 7 7 a ) cautions sternly against procrastination in implement inn 
adaptive testing. It may be wiser to proceed by making a tentative 
baa l .a: among the strategies an i ar. arbitrary decision on the test 
Length issue, letting the academic world settle the remainina basic 
r - * a i r i. i x aa- • in bio /e:urse. 



A Lir, n i v e Teats , 



Discussioa 

:'■ i ros t i-.l a tive test strategies, the traditional number correct 
or rroicrtLQi correct score will not suffice 1 to index individual dif- 
ferences on the attribute being measured. To understand this, consider 
the goal of ada: * ivo 'sting: to achieve oquiprecision of measurement 
icrosB a wide range, Tno goal is achieved by fitting the test to the 
examinee* Other things being equal, accomplishing that fit will result 
in a flat regression of the proportion correct score on the attribute 
scale. that is, test difficulty (as indexed by mean proportion correct) 
will a - • japruxi:a-i teiy e[Util across a wide range of trie attribute, As 
a result, the proportion correct scores will have an information func- 
tion whose value is near zero throughout that wide range (e , g . , McBride, 
1}75). 

I;; v r:ntico ? adaptive f ^ats v in be expected to fall somewhat short 
a- the goal of equiprecision, so tha t ttier e may be soitib in t oi motion in 
t r vi it ion-il scoring methods, Nonetheless , for the most part the propor- 
tion correct and similar indices are not adequate as general scoring 
procedures for adaptive tests, 

19 



28 



L S 



lest :H Op 



'V 



1:. -; . = .. j 



.id Sj v.; J 

cine rod 
; r * : v- 



r 1- ; p 
, ■ ■ 3 ■ ■. 

il : :: : 

h l 



r - 1 - : •-■ - \ ; 
.■ : * i o ,■ 



: : :. t 



really do -r. Li:x-d riu thai t.nu number Jii: rfeore would 

Index. Thu flexliovoL st. rato^y jml^, K-t us" consi d.-r 
. i i i Idera "a of a:; aaar •* ive tost seorina !'roo«^;U! i- , 
■ I l : :'■ ■ v- n '.. ; vi'so:;.; ■. a*:e d i f :\.q nwt not;-: of test 
-o . . -i * i . , * * 1 i ^ i t j . i . ... ; ;.. t 4 \ .i 1 ; Va i y i ; i L- i it - i ; 1 1 n - 
i'."; , I •« , .u-;j«':-t lLl lily to quests in- s . i'un und"r 
• ' ' "-'.-"I I-.r: :r h p;iy v\ii y fro : . - ) • • \ t • ^ - 

= i.f faculty level at. which rhf rest: wad bequn. Tina o 
at a ::\ in all the s aramoto ii* just Enent ionou , so that a 
eons f » ,i ■•:.)■;::: rot •:. ly for now many items a person 
rut also which Uv"' ! wcro answ^ro- i , and in some 
v.a_*La_: i: ,.„'.'. v.. - : • a cui L\:..: L iy ur L'i'jjL' r u a 1 1 y , It is du- 
-rjorirvq procedure to make a:-;. - • or all the in fo rma t ion 
-Kanuiic-i-' ^ answers, ad well as in th«_ 
. I the teat . 



identity of the 



rJ ; ma methods based on latent trait theory are especially useful 
. i t\ r r =; > for seoririq adaptive tustH, Tnis Is because- such met h- 

: ; . i:. \ ike into ; ;ount all relevant data in the coristitution of an 
;-.n VL-i :eat==. ! joa as teat lenqth and ituin characteristic curve 

i: o.-.-: vol! a a the l to:n-b*/-i tern ner tornuaicw of the exaininuu , 

>:i'_ fair I".- a L:n • ie methods are available, al'ona with others so uaimpiex 
.it *"iujy ro-tuiro a conputer to torform needed, calculations- The prob- 
er, of acorinq adaptive tests if; the r-roblem of choosinq (or devLsinq) 
\ a: j ropriqte acorinq method, Hor.e. of the available methoa's are d La- 



ctate of the Art 



I'he ri'.mhvr • q = ano;: ? nethods available for aciaptive tests is niz- 
ml-. do::;- rorl.hoda are aeneral and are applLcahle under a variety of 
ten m; srrjte ; i • , v/hile* others were devised ad hoc and are specific 
■ • ■ y. > ■ .! i \ f\jw . a mi * • • ; i.es . Amoivq' the qeneral methods we can distin- 
;un;;. statist leal proeedurea from nprn^tatistical ones, 

Stat io-Jt i oil Scorinq h rooe'.iu res , These procedures are based on 
t tin a. |ues of a^embiniriq known psychometric information about, the test 
r ^or; with the observed 1. 1 em response performance, of the examinee in 
q.ic'.i a way as to yield a statistical estim-de of . the examinee's loca- 
tion on •a*- a'tributt.' scale. Although thero are a host of such esti- 
mation methods available, the ones most prominent in tiio literature 
nave boor, estimators base:! on the P. a "rh one-paramo ter iocjistic oqive 
item response model, on the B lrnba.. thrue-parameter logistic oqive 
model, and on the three-parameter normal oqive model, 




". :. i-'r v:-v- munch moa^l, thw number correct :; joi'».; Lh ..i ,;uf : i :i «, :m 
diaiiati-' La Lhv estimation i:r.3jouuro, L provided that, :.:uj kudch • 1 i f i - 
julty ydmt'VJi of the ::e;ni coridt itut ing an individual ubi uiw k:n. 
■:ier- ia no m-;^.^:'!-! , and all item^ are eguidi^criminat ma , L-ms? 
s.-iuarc,; csti:r,.\':oi':i and iximum like Li hood estimators of ; ur:i : ibu i .r Lo v- ■ ! 

:'he maximum 1 1 ke 1 ihood -as: . ma tor i s somewhat morv e leu. ant ana mom- 
juiaie, li s ll ma to r h based on t:\%j Kaseh model aim- not a : i.jily a;: ■ ^; : 1 •= 
ate for smorinj tua:^ havi:, : known d.i f L ■ i <_ nco^ in i*-=m d i n or ma ina* i o.a 
: arii:;et or on which tagru is a .ma m .a . Li L chu •. of ..ii'.hjwc i in ; 

.;ue5t:i mm correctly by .ruossirig . Limy (li?7.jj imt;- i'valua(.L'..i Lao ^ 
,.■ : i ;nmr i o : guuddi.ua and item aidcriminat inq ;owors in sooring « mi; * iv-. ■ 
' ■, • -m i ; :"■ • * a 1 \ in H .m.e 1 o^ s o : mm , y in 'i - ; i * . i i . i • i ;.i I m 

:-:re:mmm That Loss ii reflected in the v. -.Hairy a the a da; t. i n-m 
aooros for measuring the re Levant, .ittribuie. in mm, whore Liu..- Ra^aa 
•m , ; 1 l .-; imy-i cm i ate , its use Lor I :rtr' i :\ j aaapllVf tObtri I =; not ;a<. 
'aoi'^i, Whoro it is inappropr rate , . a yjotina ; rocedure iadud o:i a ;:oim 
mm...- ra 1 response model will vx ti am. more useful information from adu; - 
-mmm? fr -ot response laotoaoha 

the more general it^m response models include two- and tinou- , 
: ariniemo normal ana log int ic naive mode Is The logistic models mm 
readily bo made to approximate closely the normal models, Hocau^o of 
their mathematical traetability, tne logistic carve models have laruoly 
supplanted the normal ogive models in use, Further, the throe-paramo tor 
mnmt : L '• ; arm m^ro j-ner a 1 , of whioh the two-pa rnjm • *■■ or ones are special 
cases; similarly, the Rasoh momel is a special case of three^paramo ter 
loqistic model. Thus, the three-parameter logistic model is the model 
predominant ly used in current practice. 

Test scoring (attribute estimation) under the three-parameter 
loqistic model usually has been accomplished using iterative maximum 
Likelihood estimation procedures. Such procedures use all the infor- 
mation available in an examinee's dichotomous item scores on an adap- 
tive (or conventional) test: item difficulty, discrimination, and 
guessing parameters; and the pattern of the examinee's right and wrong 
answers. The likelihood equations used for this scoring method have 
been derived and published (e.g., Jensema , 1972), Algorithms for per- 
c o rm i n g the estimation p r o c e d u r e h a ve bee n i n c o r po r a t e d in, s e ve r a 1 
computer programs (e,g w see Urry, 1970; McBride , 197Gb; Wood, Winger- 
sky, & Lord, 1076; Bejar & Weiss, 1979), 

Methods other than maximum likelihood may also be used for the 
statistical estimation of attribute scale location, Sympson (1976) , 
for example, recently described two alternative methods, including a 



For scoring an adaptive test using the-Rasch model, the, number correct 
is not admissible as a test score , but rather as a sufficient statistic 
for estimating ability; the resulting estimate is the test score, 

ft 
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-]C ! r>' ra 1 i B.iyuruar; o:u* , for estimation under t ho throo-iufa^iou-r 
l4qi,,ti,' . 

There 1 i* or.o prominent app licat ion of the three-parameter fUHi^il 
o-jlvo response nodal to estimating examinee location on the attribute 
a Saves L "jr. so rue.-.t li 1 procedure 7 1 von by Owen ( I \>^a . 1*75). 
- s '-■■/■rii^uo.i was pt^s^nttid as an Integra i part ,?r" his 

= Sv puential adaptive test, i :h .strategy. It is just: as appropriate for 
J --on:- j [ i"; Jv-;u:\: fur any test whgiv usn narauuteis ana a i - 
.natemous item scores are available. 

Both the maximum likelihood procedure ana Owen's Bayesian sequen- 
tial procedure are methods of estimating an examinee's location on a 

s.uun. Thure are substantial differences in approach between the 
two, however, The maximum likelihood procedure estimator the examinee 
lo-.tr-lo:: parameter from the pattern of an examinee's right and wrong 
answers to his or her test questions, by solving a likelihood equation, 
No prior assumptions are involved regarding the examinee's location 
or t :r..- .iistriLution of r.he attribute. 



wen's Bayesian procedure estimates examinee location sequentially. 
It begins with an initial estimate of the location parameter and up- 
aatcs that estimate, one item at a time, by solving equations that con- 
sider both the likelihood function of the single item score and the 
density function of an assumed normal distribution. The ability esti= 
mate is the final updated value after the last item score is considered. 

Because it is a sequential procedure, Owen's scoring method is 
order-dependent. Analyzing the same item responses in different orders 
can result in slightly different, numerical values of the final esti- 
mates. The. maximum likelihood scoring procedure is not dependent on 
the order in which items are administered (or item responses analyzed) . 

Another noteworthy difference between these two methods concerns 
their statistical properties, Owen's Bayesian estimator behaves like 
a regression estimate: extreme values are biased toward the initial 
(prior) estimate, which is the mean of the normal Bayesian prior dis- 
tribution assumed for the location .parameter , This bias may not be 
linear, as MeBride (1975) demonstrated, and may be undesirable for ap- 
plications (such as criterion-referenced testing) in which the numeri- 
cal accuracy of the estimator is of some consequence, Urry (1977a) 
pointed out that the bias in the Bayesian estimates is readily cor- 
rectable using an ancillary method , but no data are available concern- 
ing the efficacy of Urry's proposed correction. The maximum likelihood 
estimator does not seem to be subject to the systematic bias of Owen's 
Bayesian scoring method, but requires appreciably more computer process- 
ing time and sometimes fails to converge to a satisfactory estimate 
C: Bride , 107:) . 

aympson (1976) reported developing two alternative methods # for 
the examinee parameter estimation problem. One method is a 'Bayesian 
method that considers the examinee's entire vector of item scores at 



22 



si 



o.vce u.u.i tnas .jvoiia the order-dependence of Cweu's sequential scjri.'i>! 
muthod* IE i-j also n-;.ra general than Owon's method in chat it id not 
restricte,! to assunin.j a normal prior distribution on the latent attri- 
bute Instead , the user is free to specify any form for the Bayesian 
prior dist r it at ion . 

Nc^nstat i s t uai Sonrri-j Proceduit^s . The scoring methods discussed 
yivla o jt is!; i c =4 i v^t irr.attiS of an examinee 1 ^ location on a dcalo . Sev- 
eral less-sophisticated scoring methods are available that yield numeri- 
col indices useful for ordering examinees, Such methods have. the advan- 
taae of computational simplicity, but lack the properties of statistical 
^st-Utators, Indices have been proposed for several different adaotive 
"est in.; strategies, iome of these indices are specific to the strate^ 
.Ties that gave rise to them, while others are generali&able to two or 
more adaptive strategies. Weiss and Betz (1973) and Weiss (1974) have 
discussed nonstatistical scoring .methods in detail. Vale and Weiss 
(1075} evaluated alternative methods against one another and found one 
'} r i j L n all y proposed by Lord to be generally superior to the others, 

:ls index, called the "average difficulty score, 18 is computed by sum- 
. Liiq. the item difficulty values of all test: items answered by an examinee 
and computing the average, The item difficulty values involved are the, 
difficulty parameters of the item characteristic curves, not the tradi- 
tional p-vaiue difficulty indices, 

The average difficulty score is appropriate for adaptive -tests in 
which, ail examinees begin testing at the same* difficulty level, Although 
it may be ^oed in conjunction with tests having variable entry levels, 
its properties have not been systematically investigated in .such a con- 
text, The weight given to the difficulty of the first item in a vari- 
able entry level test may have the effect of biasing test scores in the 
direction of the pretest estimate of the examinee's ability. 

An alternative to the average difficulty score is t I.culate . 
only the average difficulty of the items answered corre tly; iowever , 
test scores calculated in this fashion correlate almost • c 1 1 y with 
the average difficulty of the items administered (Vale & Weiss, 1975), 
Other nonstat istical scoring procedures evaluated to date have been 
generally inferior *'to these two methods, even for scoring appropriate 
types of adaptive testsj therefore, they will not bo discussed hero, 

The Testing .Medium* * 

The adaptive test merits consideration as a possible replacement 
for conventional standardized group tests. Therefore, the test admin- 
istration medium must be amenable to testing relatively large numbers 
of examinees, There is a need to identi fy, media that, can meet this 
requirement and to evaluate such media both absolutely and in a com- 
parative sense. 



lh>- iv-ii ULio for Administerinc aiai'tivo tost;: tail into two 

jir,c-.;,?: iv-i: ai-jjially J^-n ?ned pap^r=a:id-i -enci L ceat^ and automated 
z-JdZi::.} terminals, A p a p O r - a : \ d - p ^nc i 1 ad a; live tost superficially ro- 
se~ihLos a :o::von r. ion .1 1 test . Lut requires the examinee to comprehend 
a:, i follow re 1 .it ivo ly Lex instruct ions for the sequential choice 

- " I*-:"' ;.. ; : : :,e-::n; ; 4 ... . ti iUJ juD; iuXiLy e 1 

t --x.iir. L:.-je ■ s tu^k i :i cakLno a : -ape r-and-penc i I a daptive teat may be 
ex IV- , : i ju L -r Ly for Lower ability persons, with the result 
that '■'•-<-• d i:r.'j..iiio:i to be measured i s confounded with the examinee's 
A: kLi-cv :o : i 1 " instructions. If seed a confounding occurs to 
my substantial decree, the test nay be an invalid measure of the ir- 
ter.de trait dimension. An obvious research issue is to inventory the 
.4 v i i Ivii.- to :'.-<- * :....n..; r'oi a „iir. in i ste r l : t q adai t LVc ests in the paper-and- 
poncii medium and U evaluate the extent to which examinee task con- 
: I 'Xity id excessive, 

A-.;: ;.-:na - ■ -d a imin i st ration of an adaptive test relieves the examinee 

• ••- *-he nur.ier. of complying with the complex instructions; instead, the 

* •• in ; i'-vuw assumes this burden, This benefit is not achieved with- 
< c.-nt , however. rypicaily, automated tests have been administered 
^t rnurutivo computer terminals, a medium currently more expensive 

v.- in : a- er- m i = penc i 1 a dmi n is trat ion, Tor adaptive administration of 
tests composed of items like those in paper-and-pencil qroup tests- 
typical Ly, mui t iple-choiee items — in principle, a device much less 
sophisticated than a CRT computer" terminal will suffice. Test adminis- 
tration using such a device should be considerably less expensive than 
the use of a computer. Clearly, the identification and design of al- 
ternative devices for automated testing is an important issue for re- 
search and development . 



^A^fl Sit _ JdlP_. AO 

!- ' .i\ e r - j: . -i - \ * e : i c i i /u iaj utiye_ ljg_ sts . Buyroff, Thomas, and Anderson 
-o losi-med experimental paper-and-penci 1 branching tests based on 
:%r..4thwohl and' ;t iyser 1 s (195n) scheme for a "sequential item test," a 
pyramidal adaptive strategy (Weiss, 1074), On subsequent admin is tra- 
* ion of branching tests of word knowledge and arithmetic reasoning, 
respectively, 5eeiey, Morton, and Anderson (1962) found that 5% and 22% 
of the examinees mane critical errors in following * the item branching 
instruct ions . .Such errors made those examinees' answer sheets unscor- 
n, i • ■ ;;..;■.'! t.oe searing method used; the tendency to such errors was 
related to general ability. 

^ord T (lj71a) devised the fLexilevol testing method, an adaptive 
str.i^e/y specifically intended for paper-and-pencil testing. Olivier 
(r : i?4) idm ir.i stored * f I ex i level tests of word knowledge to 6 35 high 
s d;ooi Jtudentrd and found that 1 7 of his examinees 9 tests were unscor- 
ibie b'-caa they had mnde critical errors ir\ branching, - 
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. ■ •• it, and Clivi^r (1"'~4) experiwiiC^ h r'i« si-.- 

3 ted an a:r oi : f/.dLmism about t:ho f^.^ibi Lity of n^ing tin.- papur=un i- 
:er,.-L.l ^ for aiap t ivv ' teri t u.g , I'sis : ^^rlnisr; i s Lai.^.- i on two 

: i t ; (aj A 5 :J:d^.i;i: Lai : ro; -or I ion of ^xanuno^a : us ;. -jj has be-on un- 
.ible to follow ; ho 1. ton- f o-»- i "re in b ran rh i : iy.r**. ru. T : ons ; lb) under 'ho 

d.Lo. If •. c-. t of : a;.or-an l-| o;-, : l I a- ia: l v»._- '.■.» Jf.s is * < ■ bv *L = 
va:-: # L—h 1- r. \ la- s ; 1 v- u ' T:. t i is, lb-- J,a . U-xiv.- . f \h 

an 1 1 mus l bo vo .iuced , a:*, i ri'JOi' ino r.ro^Mur-.'S :nus • be icV 1 ■ i 

: 1;. a .n.u.voda to item branch ir. ' ■jrrora. 

I'a-.- -0/ . i r _ i .-• t. i,-a 1 scoi 1 :; t;ict:u..vi.S based on item .JiunaJl.-: isliu aui/Vf 
:.;;u^i:y j .iisossswa in aaoL:ua SucLiou, satisfy Lno. latter l o :ui 1 < -;i t . 
They 1 rovIJe 1 :;ua;is of calculating a scor-, us Lnj a common metric, for 
-...xam i:a_'--s who answered different sots of tent i^o:r:s, Those scoring 
rnoEhodj arg applicable evtfn to examinees who crr&.i in item branehin I , 
I s <v : : • i th.it it is known which i toms * we re . msw-o *.> i ana whether tiie 
answer - wore ri-jht or wrong . 

.-: m jo • ho nse of l com characteristic curve tiioorv in effect solvr 
the scnraDility rroblom, all that remains to make paiK-r-und-pencil ad>j[ * 
a'a a: Lng zeaSLOle La to ~inimi-tj the ; r 00 lorn of the complexity oi 
tho" branching ta-t<. Tn rs p rob lorn has not be^n solved to date, althouah 
tenrative approachea to its solution have been taken (e.q. , McBridf, 

l':J7i) . 

rerhapa the rUmplest solution prorosod is the "sol f-tai lored teat" 
sujaeateJ by Wright and Douglas (1975) for use with test items that 
satisfy the Kasch simple logistic response models Test items are printed 
in the booklet in ascendinq order of difficulty. The examinee is in- 
structed to start answering test items at whatever difficulty level he 
or she chooses and to stop where he or she chooses (or perhaps to answer 
i fixed number of items). The test score (a Rasch ability estimate, 
which can be determined by referring to a preprinted table) ,would be a 
function of the difficulty levels of the easiest item answered and the 
most difficult item answered, and the number of items answered correctly 
in between. 

The Wr i qh t and Douq las no t i on i s appea 1 i ng in its s imp Licity , but 
it has drawbacks. First, its psychometric merits depend heavily on the 
ability and willingness of the examinee to choose test items that are 
most informative for ability levei—neither too difficult nor too easy, 
Second, its linear branching rules and ability-estimation procedures 
are not strictly appropriate where guessing "is a factor and where there 
is appreciable variability in the discriminating powers of the test 
items, Nonetheless, this "se If- tailored" testing scheme is worthy of 
some exploratory research in settings where it is desirable to reduce 
substantially the number of items each examinee must respond to. 

Where guessing ins a factor and items vary appreciably in discrimi- 
nating power, the optimal choice of items in an adaptive test is a 
function of those variables as well as of item difficulty, This suggests 
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tnat a somewhat mui-e sophisticated rationale is required for adaptive 
item branching tnun thy simple linear -ro^jression implicit in the 
Wright and Douglas proposal. Implementing a true item branching pro- 
cedure in a feasible paper-and-penci 1 version, without overbearing 
complexity, may call for new approaches. The necessarv approach is 
to minimise tne opportunity tot error by making the branching inu t ni,- 
Lions as simple aa possible and as few as possible. 

Simplicity may be achieved by using latent ink technology in de- 
signing and printing answer sheets, thereby making the branching in- 
struction unambiguous and contingent only on what answer the examinee 
^lves to the item he or she is current lv working on. The froguencv of 
item branching can be reduced by using a modified two-stage adaptive 
strategy; the first stage might be a short branching test of several 
items, while the second stage might be a multilevel test. The func- 
tion of the first stage test would be to route the examinee to an ap- 
propriate level in the second stage. Each level would have the format 
of a short conventional test; thus, no branching instructions need be 
followed during the second stage, This notion was developed further 
in a separate paper (McBride, 1978), 

^tomated Adaptive _T_esting_ . Most research on adaptive testing 
has tocused on computers as control devices and on computer terminals 
as the medium for test administration. Although the computer is a con- 
venient and apt tool for automating testing, the -relationship of com- 
puters to adaptive tests is sufficient but not necessary. Any device 
capable of storing and displaying test items, recording and scoring 
responses, and branching sequentially from item to item can in princi- 
ple suffice as the testing medium, The computational power of a com- 
puter may be highly desirable for implementing some adaptive testing 
strategies, but it is far from necessary for all. Further, tests based 
on dichotomously scored multiple -choice 'test items make such minuscule 
demands on the capability of a rodern computer that use of a computer 
solely for administration of such tests seems wasteful, Simpler and 
less costly devices can do the job, and such devices should be developed. 

The first concrete effort to develop a simple device for automated 
adaptive testing seems to have been one made at the Air Force Human 
Resources Laboratory, Technical Training Division (AFHRC/TT) . Person- 
nel there have developed a prototype programmable microprocessor termi- 
nal for administering an adaptive test (Waters, personal communication), 
The tormina 1 itself resembles a hand-held desk calculator , with an array 
of numbered Roys used to respond to test items, Its display device is 
a small array of several light-emitting diodes (LEDs) , The unit is 
preprogrammed to direct an examinee to answer a response-contingent 
sequence of test questions that are printed in a separate test booklet, 
After recording and scoring the examinee's response to the current 
test item, the microprocessor unit computes the location of the next- 
item; the LED displays that location as an item number; the" examinee 
than turns to that item in the test booklet and responds by keying in 
an answer on the keyboard. At test termination, the examinee's proto- 
col of identification data, item responses, and test score can.be 
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"dumped" to a j-jpecia 1-purposu computer bo f ore the next examines ia 
tested, Development of the AFHRL prototype is bein.i undorta^n by .tii 
independent contractor, 

,\ jlfv J 1 , ',. X I. u: ; * .•; . ..■ I. ;"^-J^O| J 1 -' J' i a L' C'.J. r t ~C" j t L <j tt='.V.j i.lLw.i 

i v AFHFl./TT. T:i i;-; would involve using tho pi oaranimab le microproeesaia 
I. o t h :._r i Election a/: J for c^n 1 ro 1 1 inc. tho dinplay on a in-r ia h- a .i ! 
device of test items stored in microform: film si L d e a , microfilm, ur 
microfiche* The contemplated device would emulate the function or a 
full-scale computer terminal, bur with limited interactive capability. 
The significance of this step ir, that, the exam ineo 1 s role would be 
Limited to unswuring the £u quince* of displayed it -a ic^ms; the La ■ 

would not have to participate in item selection or in locating st • P • at < h 
Ltons. 

In corn; i ierinq the state of the art with respect to automated 
tea" ma terminals, it is useful concept, .ally to consider the separate 
commorients reunired of a tost delivery device, These include the 
following : 

• itimuluH/display device 

• Response device 

« Item storage medium 

• Internal processing 

# Re span se p r o ce s s i n q c a nab Hit y 
m Item selection capability 

# Test .scoring capability 

# Data recording capability. 

Display devices proposed or in use range in complexity from simple 
p r i n t e d ma tter, to m i c r a f o rm re ad e rs, to c om put e r g r a ph i c s termi n a 1 a . 
Microform readers include microfilm reel readers, manual microfiche 
readers , and automated magazine microfiche and ultrafiche readers, Those 
microform devices are capable .of storing and displaying any test material 
that can be printed and photographed, including . graphic material. The 
computer terminals amenable to automated testing include teletypes, 
monochrome CRT terminals, plasma tube (PLATO) terminals , and color 
graphics CRT terminals, Computer terminals typically have integral 
provisions for response keyboards; microform display units do not. Aid. 
devices listed above are commercially available off the shelf; special 
provisions may be required to ^integrate each into a testing system and 
to interface each to a test control device, 

'With CRT or similar computer terminals, test item storage must be 
in computer code, either core-resident or mass storage resident and 
rapidly accessible. The volume of di splay able material needed to sup- - 
port a 'full battery of adaptive tests may require hundreds of thousands 
of characters of computer storage. 

Microform storage of test items is more efficient but less flexi- 
ble than computer storage. Items may be photographed and stored on 
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mi ■.: :.'"«? i L::. roll-:, ; note j far :u c slidu. Ui^inc^, ;:iu:rgi i ch-. , or uI'.sa- 
fichv. Slide are bulky an J cumboraonkj un i run considering 

only ai a pr jtotype .. microfilm rolls are a highly ijrficiont storage 

iiun, bu: the n vchir.ory needed to im; lerent attrMve te.r ire wit.:: 
item* s: —a o:: mici of i ir l.-j ox; « j:;.-; l vc a:; i i: s i; xgpri.y.o. Microfiche 

ulcraf seen to offer an a -co; abie conrruir.it.- . A finale 4-by- 

- * muroiicne ...a:, jfjnr.aiii several hundre.i a 1 sr ■ Lay i oaaes ; an ultra- 
fiche of similar :IrnenHij:^ ..aa hold about :,OJ-j Images. Thus, '-.-st. 
items :or a sizable ba:t:ury of adaptive tests could be stored on about 
ten microfiche or on a ; alt raf i.c:. _:. All that is repaired for a 

' i *'* t it^rlv.' !-;-vi >> i - re • ability r : i a t ~; m a t • the micrc f i j :;•_./ 
ultrafiche cuaoaa 

Automated microfiche readers are already commercially available 
ana ran bo modified readily to serve as testing terminals by intorfac- 
in; t hum to aa r ropr iatc* contr ol ueviees. 

The internal processing requirements of automated -adaptive test- 
in- may be accomplished by a central computer, minicomputer , or micro- 
oomrutor, eat] rely within today's state of the art. System design 
stares between current cieve 1 op more, and implementation of a computerized 
system for an; tie-, tcitit^ 

oomo efficiency or cost effectiveness ma.y be gained by the use of 
stec ial^purpose microprocessors to control the test itself and the test- 
inn -. p.. : p;;;.-;. r_. . Again, such Cuvices are well within the current a t; ate 
of cue art in electronics. The equipment needs to be designed and in- 
tegrated into a system for adaptive testing. 



Item Pools DevcLorn. *nt 



"jj seuss ion 

Adaptive tester: j involves selective administration of a small 
subset of a iuraor pool of items that measure the trait of interest. 
The a of tnis item pool, along with the psychometric characteris- 
tics • c" the constituent items, places limit" on the measurement proper- 
ties of Che adaptive test. Obviously, the item pool should be large 
emuuh and constituted so as to permit the adaptive tests to function 
effectively, L'ar.ly theoretical research in adaptive test. inq suggested 
that item pools had to be large, ranging from one or two to several 
sundred or several thousand test items. More recently, computer simu- 
lation research by Jensema (1077) and other associates of Urry has shown 
that adaptive tests can function very well at test lengths of 5 to 30 
items and that item pools containing 50 to 200 items are of sufficient 
size , provided that proscriptions for the psychometric characteristics 
of the tost item are met, These prescriptions concern the magnitude 
of the items' item response model discrimination parameters , the range 
and distribution of the item 3'tJ f icult^ parameters, and the suscepti- . >k 
bility fir the items to random guessing* 




rrv (iJ/4) h.in Listed such i rescript ions rut items ca 1 ibrate I 
iwith *he three-: aramoter calve model) against an ability scale on 
which the oxoniiict; population is distributed normal p:,15. They in- 

icom discr irradiation parameters exceeding . MO, item nu^raii'-j 

j . . , . .. . j ,. , . »- ; * . .... ; i -- J ■: . i „ 

.: I- r a : : f gx i ?.\ i r i v f l ™* . r ■ ■ * - units >:i .; s t a: i. ia r. i a:. -via? Ion 

.^oalo. y.-jUr: ^l-'-'r? su ' i-'pjted an even wiJ-r i.r.ri-.; or iu-r a 1 ff i ■ "'i 1 ■ v 
and found rhat item pools with .10 -J aria 15 j items ^ULiortoa sat isractoi y 
-ea^uro-u::: prater ti^s in their' adaptive Uidtd. Tor measurements Locu.-i* 
ir..j an the trait scale interval between - 2 and + J standard deviations 
,*Loat the panu. at ion mean, a 100-item pool seems sufficient (e.g., 
^■;;:;ur = . .■. j.-jni, 1 .' '.' ; Mchridt;, UTuu). For m- an dromon t over a wider 
interval, a wider .:; an of Uem difficulty i s indicated, a Ion a with a . 
proportional incre^-e in item pod! see tor 1 (1977) ana MeBrido 

(l-J7->b) r or examples, 

Bocauae of the requisite sio of item pools for adaptive testinq 
ana the ; redcrirtions concerning the needed psychometric character istics 
of tue Lunt item^, a question of the feasibility of assembling adequate 
item pools ar ises . Farqe numbers of test items Used in conventional 
tests will not "i..,-: tne discrimination parameter criterion na inclusion 
in vdar>r-Lvr? tost item pools.- Furthermore, the wioe , " rectangular dis- 
tribution of item difficulty specified by diary's nrescr iption may be 
difficult to satisfy. ' In many settings it may not be feasible to con- 
struct adaptive. test item cools from off-the-shelf .test items, However, 
where laree-scale testinn programs o"e already in progress, the out look 
is better. Urry ( U74) , for example, was able to assemble a 200-item 
dooI. for adaptive testinq of verbal ^ability by screening about 700 items 
in 15 forms of a f.S. civil Service Test. Lord (1977) has made availa- 
ble for research a pool of 690 verbal items from obsolete forms of sev- 
eral tests published by the Educational Testinq Service, 

In military test in q ,' current and obsolete test batteries in the 
qqqroqace contain hundreds o.f test items for each of several cognitive 
ability variables that have been measured by military tests for several 
vears. For example, test variables such as word knowledge, arithmetic 
reasoninu, and general information have been included in Army selection 
test batteries 1 through several generations of tests and multiple forms 
within each generation. Such tests can be expected to contain, in their 
various alternate forms-, sufficient numbers of test items from which to 
select the items to constitute item pools for adaptive testinq. 

For tost variables rot having a large bank. of items already in 
existence, a major i tem-wri tinq/item-pooi development program will be 
necessary, bven for variables already well r-e resented in large num- 
bers of ' test items, other problems remain to be solved before the 



Examples include the Armed Forces Qualifying Tests (AFQT) , the Army 
Classification Battery (ACB) , and .the current Armed Forces Vocational 
Aptitude Battery (ASVABO . , 1 
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■ - ' : i turn ; Quid ,:,in Lg assembled, ft-reno? iiwiV is t,, -m,. 

i -■-*!■. ■ ■ iHbr at io:;-^ r iir. at : :n th,- -e trail re** ouc-..= m,, j. d 

I i:Mn»--:-.»ri c : item 1 s j;i.i:\i;;ruri:nij curve, 
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i\-*m ; its 



: . = om: ; .i,.>s uu vs: i- 

i.:"ilC iatj r-.MUU'oJ by SUCh 

-* r * ; " -T.-J.sLy ,r, : i r-;-,^,',,; of ,xaml-,- c to a :r, S- 

^'"'; V 1 11 : - — -' r jr '-^ Lr-.nirf. Prrv (l,?7 a ) and Sciimidt and ..^-I 
^ : • ;i tV - L ' , -"r' 0 '^"-i fuioarcn u^ui^ that ^u-i-josf L:ijl the number of 

equal or exet^d J # .juO in jrd.a to achieve accurate oh- 
Mma-^d or uon pranetors for a three-parameter item response model 

Jmcilier nurnours will jui ll.v tor ehe simpler but. 
■i.vraL one- and two-parameter res; use models. The " important, 
lei.-.t is -hat errors of parameter estimation will increase as either 
'-' "- r - :i - tw o sample sices— items and persons — decreases - 

in ;aiibracinu tiw test items of large-scale testimr programs , 
; ' •' • ' lJ : ' : - i:i> - : -"^VAO, access to adequately large examinee sampler 

-° ^ problem, si ice hundreds of thousands of examinees take 
t.-.u::. r^rm o: a battery annually, However, the item sample sizes are in 
: ' : a:/ -" 1 ; "' : L:ui T-ito by prry's standards For example, tne . longest . 
subtest Ln L he current AS-v AB is only 30 items. Most ASVAB subtests are 
•snorter. It accurate item calibration is not possible using the exist- 
in; answer sheets from such subtests, then item calibration studies 
w 1 1 1 1 to include administration of Longer subtests to large numbers 

or examinees In a testing prourom separate from current operational 
^Mttna, On the other hand, if a means can be found that will permit 
cura?: e item calibration based on item responses to current subtests, 
r ** 1 1 1 b *" ' l substantia! reduction in the expense and effort required 
-i. i •-■ ^ ; miiV'j tusLing item pools, 

v. 

otatM or the Art . pr> r; estimating item parameters under a three- 
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: ii imeter response mode L , two existing computer programs are appropri- 

-rvpfA, P^crLbed by Urrv (1977a); and LOG 1ST, described by Lord 

£i..'74b). Item calibration research based on OGIVKI'A led Urry to pre- 
Mscrice test lerinths of Gj items and examinee samples of 2,000 as the 

minimum values f >r satisfactory parameter estimation, Lord (1974b} 
' r'-'omm* i a similar examinee sample si^e, but made no mention of the 

re ;uio Lto test length , 

: " l "i ' ' nigrum La appropriate for calibrating drchotomously scored 
items only; no provision is made for item scores other than right or 
wro:i '' further, it explicitly assumes a normal distribution of the 
ability parameter, ^ LOG 1ST contains explicit provision for different i- 
itinq unanswered items from those answered incorrectly. It treats dif- 
ferential Lv two 'categories of unanswered items : items reached but 
omitt-d arid items not reached. Items not reached are ignored during 
thr port ion of the item calibration process in which an examinee's 
ability parameter is estimated, Lord . (1974b) ,has suggested that this 
feature of LonrST may be useful for calibrating sets of test items in 
which not all examinees answer the same items. Thus it may be possible 
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tc a^L- ■ . Li: iiibraco 3 imu .1 taiitHJiid 1 v i^ems from lwo i nuev ulte^- 

-jce farms, wl.-.Tv a .liferent examint^ ,4.1:11: b< red: ui.ds to --ach form, 
LJGIoT makes no as _io;.>; '"eaardiuq the :"olT: z'.v- 1 ! i .;i r' ibut l >:i uf 

in t Li" v . 

Two t'^ae.t: ; :• , ■ i . :, s ;;*-- = ; ia-s-. La* : ■.. > . : 1 a a a- la; 1 1 V» ■ UaK 

L " [,UDia ,'4:; a a ,;g:,^L L'UJUs; I ; wHl uXlsUIi i ta.'SL Ltem_j. PilSt. , Wluil 

are ef f-jjts of cal iLratir. : * es t - ituitH from r. in.- answer ;.p n i" 

rjiaor ahort vasts (20 to • ) ^KO F u! f if taiesa ofrV uo hol 

favorable, it feasible to calibrate i ' ens by pooliiva a.-; re.:* sheets 
fr^r. tw^ r r.c ■ r - rr -• , 1 a. t \ by f " ^xir.i':"!'^, f O incruaH- 
the number of itenid no a s i ee needed :jl satisf aetory ea I ibrntion J 
I'be se questions are not r-adily amer^ab^- to answers based on theoreti- 
cal ar mathemati :a L analysis, aowever, tlwr may be answered umpiri- 
cally by means of simulated calibration of artificial item response 
■.sat a i!o;o linos used by Lord (1075b) or by Schmidt, and baaed (i')75), 

A related issue is one of equating the scales derived from iiulu- 
pen lent -a I ibrat ions of test items measuring a common variable but con- 
-=amer in different tests. This, is the same problem as making item 
parameter- es fc ima" es that result from calibration of different tests in 
ibiao.:: gXarduOw samples all have reference to the same ability metric 
Lord (1975a) has suggested a number of equating methods, based on item 
characteristic carve theory, that are applicable to this problem. Home 
of those aquatin-j methods have distinct advantages over traditional 
equating metnoos. 



Ach^nccyg. i n Measurement Methodo logy 

D iscuss ion " 

Current methods of measuring psychoid? ica 1 , traits 'overwhelmingly 
use tests composed of diehotomously scored items, in ability measure- 
ment, each such item is a task, chosen from the domain of relevant 
tasks, that an examinee performs successfully or unsuccessfully, cor- 
rectly or incorrectly. Performance on each item task is taken as an 
indication of the examinee's level of functioning on an underlying 
ability trait. Thus, the trait is only indirectly measured, using 
item tasks that have only imperfect fidelity to the trait of interest. 
For example , mult 1 pie -cho ice vocabulary test items often are used to 
me a s u r e verbal ab i 1 i t y _ 

Most adaptive testing research has used the same kinds of items. 
Adaptive testing using traditional item types Represents an improvement 
in the efficiency of measurement hut no improvement in the fidelity of. > 
the test behavior to the trAait of interest. 

The usual media of group test administration, pa pe r - an d- pen c i 1 
booklets and, answer sheets, necessitated the compromise of task fidelity 
Administration of tests by computer terminals, as is common in adaptive 
testing research, opens up the possibility of introducing whole new 



modes of stimulus and response to the methodology of measuring psycho- 
logical abilities and perhaps of improving the ' fidelity 'between* tesHs 
and abdliti.es , The implications of computerized test administration 
for measurement are potentially vast, as is the number .of ^research 
issues, * 

The basic issue is this : How can the capability of the computer 
be exploited to yield more and better test information about individual 
examinees? This subsumes other questions , such as i Can test stimuli 
be enriched/ and/or response modes expanded , to achieve improved mea- 
sures of current ability variables? Can non traditional" ability variables 
be identified and measured',, yielding improvemen.ts in test* fidelity 'and 
validity? Can advances in measurement procedures be made that are ac- 
companied by advances in practical utility? 1 



Sta te of the Art 

A comprehensive review of the current status of- research in these 
issues is beyond the^ scope of this paper. Only a cursory overview will 
be attempted. 

For measuring traditional ability variables, expanded stimulus and 
response modes are made possible by computer administration, On the 
response side, several different approaches are possible. One is to 
permit on-line polychotomous scoring rather than dichotom^i^^corimj of 
traditional multiple-choice type. items; Samejima (19uJ) and Book (1072) 
have developed psychometric procedures to support such* item scoring 
methods , A more sophisticated approach is to accept natural language, * 
or free responses, to traditional test item stimuli; the examinees 
could type their answers in full on a typewriter- like keyboard rather 
than choose multiple-choice answers . Natural^ language processing com- 
puter programs would be used to check free- form responses against the 
nominal correct answers and thus to score item performance (see, for 
example. Vale S Weiss (1977)) , * 

Traditional test stimuli are static and usually monochrome; this 
is necessitated by the printed medium in use* Presenting stimuli at 
computer terminals makes it possible to introduce, multicolored stimuli 
and to use dynamic tesTr-i terns, For example/ the examinee may be per- 
mitted to "rotate" in spac£_^a=-HtF§e-dimGnsional figure presented on a 
CRT screen to facilitate visualization, Cory (1978) has experimented 
with the use of fragmentary pictures as test item stimuli/ with the 
examinee able to' increment the proportion of the picture presented. 

Computer administration has been suggested as a means of measuring 
ability variables not convenient to test in paper-and-pehcil format 
(Weiss, 1975) . .This will permit test designers and users to transcend 
the limits of traditional ability tests that measure verbal ability 
and logical, sequential analytical functions associated with the left 
hemisphere of the brain. Spatial perception, short-term memory, judg- 
ment, integration of complex stimuli, cognitive information-processing, * 
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and other complex abilities may be measurable /by exploiting the power 
and flexibility of the comput r terminal as a testing medium, Cory 
(1978) has conducted. exploratory research investigating computer ad— * 
ministration of some novel item types,- .Valentine 1 ( 1977) has discussed 
prel iminary *e f forts directed toward computerized assessment of certain 
psychomotor abilities. Rim land and his associates (Lewis, Rimland, & 
Callaway , 1977) have used a computer to facilitate measurements of 
brain activity that may be related to ability variables: Rose (1978) 
is investigating measures of. cognitive information processing " skills 
using dynamic computer-administered problems as test items. All of the 
efforts just listed have shown some promise^ but they must be considered 
as exploratory efforts ^that may or may not lead to developments that^ 
supplant or complement traditional methods of measuring psychological 
abilities* 
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